Everything in New York is expensive. For first time travelers, New York may seem even more expensive. At the same time, travelers have different wants and needs from their accomodation that a student or a working person would. So I wanted to analyze the price trend of the Airbnb listing prices in the eyes of a traveler.
Travelers of different budget and purpose would have different priorities, but most would definately prefer good accessibility to the top tourist attractions they want to visit. Will this have an effect on the Airbnb rental price?
For this data analysis, I used the Airbnb open data avaiable here. I used the listing.csv file for New York.
Since the csv file contained more than 20,000 entries, I decided to do some basic scrubbing first and then export to a different csv using the csv library. I then used the pandas library to manipulate and display selected data and used the matplotlib and seaborn libraries for visualization. To calculate the average distance from each listing to the top rated tourist attractions of New York, I used the Beautiful Soup library to parse the website and retrieve a list of attraction names. I then used the Google Places API to get each attraction spot's detailed latitude and longitude to calculate the great circle distance from each airbnb apartment.
In [3]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import numpy as np
import seaborn as sns
import statistics
import csv
from scipy import stats
from bs4 import BeautifulSoup as bs
import urllib.request
from googleplaces import GooglePlaces, types, lang
from geopy.distance import great_circle
import geocoder
%matplotlib inline
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())
In [4]:
apikey = 'AIzaSyAiZn9omCnuF2q89cArLpVfxmlGV7nnjFA'
gplaces = GooglePlaces(apikey)
This function takes 2 parameters, the url of the trip advisor link and the number of top attractions one wants to check. It then uses the beautiful soup library to find the div that contains the list of top rated tourist attractions in the city and returns them as a list.
In [5]:
def tripadvisor_attractions(url, how_many):
page = urllib.request.urlopen(url)
#using beautiful soup to select targeted div
soup = bs(page.read(), "lxml")
filtered = soup.find("div", {"id": "FILTERED_LIST"})
top_list = filtered.find_all("div", class_="property_title")
sites = []
#save the text within hyperlink into an empty list
for site in top_list:
site = (site.a).text
site = str(site)
if not any(char.isdigit() for char in site):
sites.append(site)
#splices the list by how many places user wants to include
sites = sites[:how_many]
return sites
This function takes the list returned by the tripadvisor_attractions() function as well as the city name in a string. I explictly ask for the city name so that Google Places API will find more accurate place details when it looks up each tourist attraction. It returns a dataframe of the tourist attraction, its google place ID, longitude, and latitude.
In [6]:
#ta short for tourist attraction
def ta_detail(ta_list, city):
ta_df = pd.DataFrame( {'Tourist Attraction' : '',
'place_id' : '',
'longitude' : '',
'latitude' : '' },
index = range(len(ta_list)))
for i in range(len(ta_list)):
query_result = gplaces.nearby_search(
location = city,
keyword = ta_list[i],
radius=20000)
#get only the top first query
query = query_result.places[0]
ta_df.loc[i, 'Tourist Attraction'] = query.name
ta_df.loc[i, 'longitude'] = query.geo_location['lng']
ta_df.loc[i, 'latitude'] = query.geo_location['lat']
ta_df.loc[i, 'place_id'] = query.place_id
return ta_df
In [7]:
def latlong_tuple(ta_df):
tuple_list = []
for j, ta in ta_df.iterrows():
ta_geo = (float(ta['latitude']), float(ta['longitude']))
tuple_list.append(ta_geo)
return tuple_list
This function is the main data scraping function. I tried to first import the csv as a dataframe then clearning each entry, but the pandas iterrow and itertuple took a very long time so I decided to do the basic scrubbing when I was importing the csv. This function automatically saves a new copy of the cleaned csv with a file name extension _out.csv. The function itself doesn't return anything.
In [8]:
def clean_csv(data_in, geo_tuples):
#automatically generates a cleaned csv file with the same name with _out.csv extension
index = data_in.find('.csv')
data_out = data_in[:index] + '_out' + data_in[index:]
#some error checking when opening
try:
s = open(data_in, 'r')
except:
print('File not found or cannot be opened')
else:
t = open(data_out, 'w')
print('\n Output from an iterable object created from the csv file')
reader = csv.reader(s)
writer = csv.writer(t, delimiter=',')
#counter for number or rows removed during filtering
removed = 0
added = 0
header = True
for row in reader:
if header:
header = False
for i in range(len(row)):
#saving indices for specific columns
if row[i] == 'latitude':
lat = i
elif row[i] == 'longitude':
lng = i
row.append('avg_dist')
writer.writerow(row)
#only add the row if the number of reviews is more than 1
elif(int(row[-1]) > 7):
#creaing a geo tuple for easy calculation later on
tlat = row[lat]
tlng = row[lng]
ttuple = (tlat, tlng)
dist_calc = []
#calculate the distance from each listing and to every top tourist attractions we saved
#if the distance is for some reason greater than 100, don't add it as it would skew the result.
for i in geo_tuples:
dist_from_spot = round(great_circle(i, ttuple).kilometers, 2)
if (dist_from_spot < 100):
dist_calc.append(dist_from_spot)
else:
print(ta['Tourist Attraction'] + " is too far.")
#calculates the average distance between the listing and all of the toursist attractions
avg_dist = round(statistics.mean(dist_calc), 3)
row.append(avg_dist)
writer.writerow(row)
added += 1
else:
removed += 1
s.close()
t.close()
print('Function Finished')
print(added, 'listings saved')
print(removed, 'listings removed')
In [9]:
url = "https://www.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html"
top_10 = tripadvisor_attractions(url, 10)
In [10]:
print(top_10)
In [11]:
ta_df = ta_detail(top_10, 'New York, NY')
geo_tuples = latlong_tuple(ta_df)
ta_df
Out[11]:
The cell below reads in the original csv file, removes some unwanted listings, and adds a new column that has the average distance from the top 10 Trip Advisor approved(!!) tourist attractions.
In [22]:
clean_csv("data/listings.csv", geo_tuples)
We then make a copy dataframe listing to play around with.
In [24]:
df = pd.read_csv('data/listings_out.csv')
print('Dimensions:', df.shape)
df.head()
listing = df.copy()
In [25]:
listing.head()
Out[25]:
In [43]:
area = listing.groupby('neighbourhood_group')
nbhood_price = area['price'].agg([np.sum, np.mean, np.std])
nbhood_dist = area['avg_dist'].agg([np.sum, np.mean, np.std])
In [44]:
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
fig.suptitle('NY Neighbourhoods: Price vs Average Distance to Top Spots', fontsize=10, fontweight='bold')
nbhood_price['mean'].plot(kind='bar', ax=ax[0], color='mediumslateblue')
nbhood_dist['mean'].plot(kind='bar', ax=ax[1], color = 'orchid')
ax[0].set_ylabel('Price', fontsize=10)
ax[1].set_ylabel('Average Distance', fontsize=10)
Out[44]:
Then I used the groupby function for neighbourhoods to see a price comparison between different New York neighbourhoods
In [45]:
area2 = listing.groupby('neighbourhood')
nb_price = area2['price'].agg([np.sum, np.mean, np.std]).sort_values(['mean'])
nb_dist = area2['avg_dist'].agg([np.sum, np.mean, np.std])
In [46]:
fig, ax = plt.subplots(figsize=(4, 35))
fig.suptitle('Most Expensive Neighbourhoods on Airbnb', fontsize=10, fontweight='bold')
nb_price['mean'].plot(kind='barh', ax=ax, color='salmon')
Out[46]:
In [47]:
breezy = listing.loc[listing['neighbourhood'] == 'Breezy Point']
breezy
Out[47]:
The second most expensive neighbourhood is also not in Manhattan, in contrast to the first visualization we did that showed Manhattan had the highest average Airbnb price. All apartments in Manhattan Beach turns out to be reasonably priced except "Manhattan Beach for summer rent" which costs 2,800 USD per night.
It seems that outliers are skewing the data quite significantly.
In [48]:
beach = listing.loc[listing['neighbourhood'] == 'Manhattan Beach']
beach
Out[48]:
In [49]:
area = listing.groupby('room_type')
room_price = area['price'].agg([np.sum, np.mean, np.std])
room_dist = area['avg_dist'].agg([np.sum, np.mean, np.std])
In [50]:
room_price['mean'].plot(title="Average Price by Room Type")
Out[50]:
In [51]:
apt = listing.loc[listing['room_type'] == 'Entire home/apt']
apt = apt.sort_values('price', ascending=False)
apt.drop(apt.head(20).index, inplace=True)
apt.head()
Out[51]:
In [52]:
sns.jointplot(x='avg_dist', y="price", data=apt, kind='kde')
Out[52]:
Plotting the Entire Room listings without the top 20 most expensive ones show that there are 2 concentrated correlated areas between average distance and price. The bimodal distribution in average distance might be the concentration of Airbnb listings in Manhattan and Brooklyn
In [36]:
f, ax = plt.subplots(figsize=(11, 6))
sns.violinplot(x="neighbourhood_group", y="price", data=apt, palette="Set3")
Out[36]:
Plotting a violin diagram of the prices of all entire homes in different neighbourhood groups show us that Manhattan has more distrubted price range of apartments, albeit on the higher end, while Queens and Bronx have higher concentration of listings at a specific point at a lower price range.
To deal with some of the outliers at the top, I tried deleting the top 10 or 20 most expensive ones, but this method wasn't very scalable across the dataset neither was it an accurate depiction of the price variety. So I decided to first get an understanding of the most expensive listings in New York and then to create a separate dataframe that removes data entries with price higher or lower than 3 standard deviations from the mean.
In [53]:
fancy = listing.sort_values('price', ascending=False).iloc[:50]
fancy.head(10)
Out[53]:
In [54]:
fancy.describe()
Out[54]:
It is likely that some of the listings listed above are specifically for events and photography, rather than for traveler's accomodation. Also it seems like some of the hosts who didn't want to remove their listing from Airbnb but wasn't available to host rather listed the price as 9,900 USD.
Some of the listings that seemed "normal" but had a very high price were:
In [55]:
reviewed = listing.loc[listing['number_of_reviews'] > 1]
reviewed.describe()
reviewed = reviewed[((reviewed['price'] - reviewed['price'].mean()) / reviewed['price'].std()).abs() < 3]
reviewed.describe()
Out[55]:
In [56]:
fig, axs = plt.subplots(1, 2, sharey=True)
fig.suptitle('Do Reviews and Price Matter?', fontsize=20, fontweight='bold')
reviewed.plot(kind='scatter', x='reviews_per_month', y='price', ax=axs[0], figsize=(16, 8))
reviewed.plot(kind='scatter', x='avg_dist', y='price', ax=axs[1])
Out[56]:
The 2 plots above try to find if there would be any relationship between the number of reviews per month (trust and approval) as well as the average distance from the top attractions. Reviews per month plot does not seem to display any positive correlation between price and user approval, which makes sense as there are many other factors that determine an apartment rental price than user approval.
The average distance plot shows an interesting negative correlation between average distance and price. The lower the average distance is, the higher the price seems to be.
Both graphs show that many hosts like to mark prices discretely, by increments of 5 or 10, as there is a heavy concentration of data along y axis along the grid lines.
In [57]:
f, ax = plt.subplots(figsize=(11, 5))
sns.boxplot(x="neighbourhood_group", y="price", hue="room_type", data=reviewed, palette="PRGn")
Out[57]:
The scatterplot above shows how big of a discrepancy apartment prices in Manhattan is. The top 25% of the apartments in Manhattan range in price from 400 USD to more than 700 USD, while those in Bronx span range of just 200 to 300.
In [58]:
reviewed2 = reviewed[((reviewed['price'] - reviewed['price'].mean()) / reviewed['price'].std()).abs() < 2]
sns.jointplot(x='avg_dist', y="price", data=reviewed2, kind='kde')
Out[58]:
For a better visualization of the correlation between price and average distance, I plotted another graph with only the 95% of the dataset, (i.e. those with 2 standard deviations within from the mean). This joint plot shows that there are two highly concentrated areas of apartments around 5km away from top tourist attractions on average at price of around 90-100 USD per night, and those around 8km away and 50-60 USD per night.
By looking at several visualizations of the Airbnb data in New York, I was able to find some negative correlation between the average distance away from the most famous sights and price per night. Data grouped by neighbourhood yieled the expected result of the highest average price per night in Manhattan and the lowest in Bronx and Queen. The listings.csv data I used contained a summary of the data so it was not as easy to analyze the detailed factors in determining the price. Moving forward, however, I would love to analyze the detailed version of the open data to identify a more accurate correlation between price and apartment size, availability, reviews, average distace and so on.
In [ ]: